Lower respiratory tract microbiome composition and community interactions in smokers

The lung microbiome impacts on lung function, making any smoking-induced changes in the lung microbiome potentially significant. The complex co-occurrence and co-avoidance patterns between the bacterial taxa in the lower respiratory tract (LRT) microbiome were explored for a cohort of active (AS), former (FS) and never (NS) smokers. Bronchoalveolar lavages (BALs) were collected from 55 volunteer subjects (9 NS, 24 FS and 22 AS). The LRT microbiome composition was assessed using 16S rRNA amplicon sequencing. Identification of differentially abundant taxa and co-occurrence patterns, discriminant analysis and biomarker inferences were performed. The data show that smoking results in a loss in the diversity of the LRT microbiome, change in the co-occurrence patterns and a weakening of the tight community structure present in healthy microbiomes. The increased abundance of the genus Ralstonia in the lung microbiomes of both former and active smokers is significant. Partial least square discriminant and DESeq2 analyses suggested a compositional difference between the cohorts in the LRT microbiome. The groups were sufficiently distinct from each other to suggest that cessation of smoking may not be sufficient for the lung microbiota to return to a similar composition to that of NS. The linear discriminant analysis effect size (LEfSe) analyses identified several bacterial taxa as potential biomarkers of smoking status. Network-based clustering analysis highlighted different co-occurring and co-avoiding microbial taxa in the three groups. The analysis found a cluster of bacterial taxa that co-occur in smokers and non-smokers alike. The clusters exhibited tighter and more significant associations in NS compared to FS and AS. Higher degree of rivalry between clusters was observed in the AS. The groups were sufficiently distinct from each other to suggest that cessation of smoking may not be sufficient for the lung microbiota to return to a similar composition to that of NS.


INTRODUCTION
Smoking, a known contributor to the development of human disease, can affect the resident microbial communities in different body niches [1]. Several mechanisms have been proposed, including smoking-related immunosuppression [2], enhancement of biofilm formation for certain species [3,4] and species selection by the effect of local oxygen tension [1]. Further, the oral cavities and upper airways have direct contact with smoking chemicals, heat and the microbes contained in cigarettes, all of which can play a role in altering the microbiome composition [1]. Therefore, it is not surprising that the microbial profiles of the oral and nasopharyngeal regions showed marked differences between active (AS), former (FS) and never smokers (NS) in previous studies [5]. It has been postulated that the smokingrelated dysbiosis in the oral microbiome may account for the increased prevalence of respiratory tract complications seen in smokers [6]. Significant differences in the composition of the subgingival microbiome in periodontitis have been described between smokers and nonsmokers [7]. The salivary microbial signature at the genera level is in fact so well differentiated that biomarkers discovered using linear discriminant analysis effect size (LEfSe) can be used to classify smokers and nonsmokers [8]. These differences may change with smoking cessation, as the significant differences in weighted UniFrac, unweighted UniFrac and Bray-Curtis distances observed in the overall tongue microbiome compositions between current and never smokers are not seen between former and never smokers [9]. The effects of smoking on the lower respiratory tract (LRT) microbiome are less clear.
It is now well established that the LRT is not sterile and harbours distinct microbiota from the upper respiratory tract [10][11][12]. A significant portion of the healthy lung microbiome is similar to that of the oral and nasopharyngeal cavities, most likely by microaspiration [13,14]. Accordingly, studies using bronchoalveolar lavage (BAL) sampling have shown that the LRT microbial communities in healthy individuals are dominated by bacteria from the phyla Firmicutes, Proteobacteria, Actinobacteria and Bacteroidetes [15]. Unique LRT-specific organisms occur at very low levels but are variable between individuals and may change regionally depending on the lung microarchitecture [13,16,17]. Studies on the LRT bacterial composition in cases of smokingassociated diseases such as chronic obstructive pulmonary disease (COPD) [16,[18][19][20], pulmonary fibrosis [21] and lung cancer [22,23] have shown distinct differences compared with non-diseased subjects. It is therefore possible that microbial dysbiosis affects the pathogenesis of the disease [24,25], which highlights that a more in-depth understanding of the effects of smoking on the resident microbial communities of the lung is needed.
The influence of smoking on the LRT microbial composition was studied by Morris et al. [25]. Using BAL samples from 45 nonsmokers and 19 current smokers, they described significant differences in the oral microbial communities of nonsmokers compared with smokers, but their analysis did not show significant differences in the lung microbiomes of the two cohorts [25]. The investigation by Morris et al. was confined to surveying the differential abundances of various bacterial taxa. Further, their study did not include any former smokers. In this study, we sought to compare the LRT microbiome profiles of AS, FS and NS using standard ecological parameters and analytical tools to better characterize the bacterial communities of the lung. By assuming that frequently interacting bacterial taxa must co-occur, we used co-occurrence analysis to obtain a glimpse into the possible 'social network' [26] among the bacterial taxa in the lung microbiome.

Participant selection
The study presented here is a prospective single-centre cohort study approved by the University of Miami's Institutional Review Board. Volunteer subjects had to be >40 years old, either nonsmokers or smokers of at least 10 pack-years in their lifetime. Former smokers had to be abstinent of tobacco use for at least 12 months, whereas active smokers had to have smoked at least one cigarette within 3 days of enrolment. Participation exclusion criteria included pulmonary malignancies or prior thoracic surgery, use of antibiotics, systemic steroids or immunosuppressant agents within 12 weeks, chronic use of inhaled medications for any lung condition, or presence of acute upper respiratory infection within the previous 4 weeks. In short, participation exclusion criteria included the diagnosis of chronic lung disease with the exception of COPD (including chronic bronchitis). Note that the definitions of COPD and chronic bronchitis used for clinical purposes are arbitrary and therefore we did not exclude subjects with these clinical diagnoses because the intention of the study was to assess the interaction of smoking and the microbiome. In addition, we made sure that subjects were in a good clinical condition to undergo a bronchoscopy as per the physician investigators.
All subjects underwent complete pulmonary function testing and a detailed clinical and demographic questionnaire. The sampling procedure was standardized for all subjects, using a flexible bronchoscope (Olympus, PA, USA) via the nasal passage under moderate sedation in accordance with standard clinical practice. After the anaesthetization of vocal cords and trachea with 1 % lidocaine, the bronchoscope was directly inserted into the right middle lobe bronchus and wedged. Bronchoalveolar lavage fluid (BAL) was collected by instilling three 20 ml aliquots of normal saline, immediately aspirated using a collector trap inserted on the suction port of the bronchoscope. The BAL was then centrifuged, and the supernatant stored in 1 ml aliquots at −80 °C until analysis.

DNA purification
Total DNA was extracted from the BAL samples using a FastDNA SPIN kit for soil (MP Biomedicals, Solon, OH, USA) with a slightly modified protocol. Briefly, the BAL samples were thawed, added to Lysing Matrix E tubes, and homogenized in the FastPrep (MP Biomedicals, Solon, OH, USA) instrument. After centrifugation and supernatant separation, the protein was precipitated out using protein precipitation solution (PPS). To isolate the metagenomic DNA, the Binding Matrix solution was added to the supernatant. The mixture was then added to a SPIN Filter and Catch Tube apparatus and centrifuged. The contents of the catch tube were discarded. The column was washed with salt/ethanol wash solution (SEWS-M), centrifuged and air-dried for 5 min. The metagenomic DNA was eluted using 100 µl of DNase/pyrogen-free water (DES). To increase DNA yield, the tube was incubated for 5 min at 55 ˚C in a heat block before centrifugation. The DNA was quantified via UV spectrophotometry on the Synergy HT Multi-Detection Microplate Reader (BioTek Corporation, Winooski, VT, USA).
All amplifications were performed on either a PTC-200 Peltier Thermal Cycler (MJ Research, Waltham, MA, USA) or a Touchgene Gradient PCR Thermal Cycler (Techne, Burlington, NJ, USA). The amplification parameters used were as follows: 95 °C for 5 min initial denaturation and HotStarTaq Plus activation, followed by 35 cycles of denaturation at 95 °C for 20 s, primer annealing at 50 °C for 1 min and an extension at 72 °C for 1 min, followed by a final elongation step at 72 °C for 10 min. Total DNA from Pseudomonas aeruginosa PAO1 was used as a positive control. PCR reactions without the template served as a negative control. The PCR amplifications were run either in duplicate or triplicate, depending on the DNA yield.

Gel electrophoresis and DNA purification
Gel electrophoresis was performed using a 2 % (weight/volume) 0.5× Tris-Borate-EDTA (TBE) agarose gel (MG Scientific JT Baker Agarose, Pleasant Prairie, WI, USA). A 100 bp DNA ladder was co-loaded as a size reference. Gels were stained with Invitrogen SYBR Safe DNA Gel Stain (Life Technologies, Grand Island, NY, USA) and viewed under UV trans-illumination on a KODAK Gel Logic 2200 Imaging System (Carestream Health, Rochester, NY, USA). The appropriate bands were excised and purified using the Wizard SV Gel and PCR Clean-Up System (Promega, Madison, WI, USA) and the provided protocol. In brief, the gel slices were melted and dissolved in Membrane Binding Solution (Promega, Madison, WI) and then added to an SV Minicolumn assembly and centrifuged. The bound DNA was washed several times using Membrane Wash Solution and then eluted using 50 µl of nuclease-free water.

Library construction
The resulting amplicons were treated as shotgun fragments for library construction (ICBR Core Facilities, University of Florida, Gainesville, FL, USA). The protocol used was a slight modification of the method described in the user's manual provided with the Rapid Library Construction kit (Roche Applied Science, Penzberg, Germany; Rapid Library Preparation Manual, GS FLX Titanium Series, October 2009, revised January 2010). The final library was quantitated in the QUBIT fluorometer (Invitrogen/ Thermo Fisher, Waltham, MA, USA) and the BioAnalyzer (Agilent, Santa Clara, CA, USA). The conditions and procedures for emulsion PCR and bead enrichment are as described in the Roche protocols. A total of 340 000 enriched sequencing beads were loaded on a 1/8 region of a picotitre plate for sequencing on the Roche 454 GS FLX instrument.

Image processing
Raw image signals were processed offline using the GS Run Browser application that resides within the computer connected to the instrument (software version 2.0). Amplicon libraries were prepared using the 'standard' signal-processing pipeline. The process includes a series of normalization, correction and quality-filtering algorithms, and the application threads the remaining (high-quality) signals into 'flowgrams' for each well (reads). Base calls with associated quality scores for the individual reads were produced and output as standard flow gram format files. A library sample that was successfully sequenced on a 1/8 picotitre plate region (Titanium Chemistry) typically yielded 70 000-100 000 reads, 300-400 bp long, and an average quality score of ≥30.

Operational taxonomic units (OTUs)
A basic workflow was modelled on Costello stool analysis [28]. An oligos file was written manually for each sequencing pool to group reads by subject. Pools contained a maximum of 12 samples with each group (AS, FS and NS) randomly distributed among the pools, and each sample with a unique barcode. One base error was allowed per barcode. Primers were not removed from sequences. Cutoffs were set at 25 for the quality score and 50 bp for minimum length. Sequences were aligned using the Silva Bacteria Alignment Database as a reference [29] and the Needleman-Wunsch alignment algorithm with a k-mer size of 8 [30]. The reverse complement was used for the alignment if 50 % or more of the reads were removed during the original alignment, indicating that the reads may have been flipped. The chimaeras were removed using UCHIME [31]. The classification was performed with a k-mer size of 7 and a confidence cutoff of 60. Reads were clustered into OTUs based on sequence similarity (distance <0.03%) and classified against the silva database (v. 132) [29]. Each read for each subject was mapped to an OTU. Abundance for each OTU for each subject was measured as the number of reads mapped to that particular OTU.

Decontamination
Reads below 99 % confidence at the kingdom level (bacteria or archaea) were considered to be contaminated and removed, as were all eukaryotes, mitochondria, archaea and chloroplasts. An early analysis of read counts indicated contamination by members of the genus Halomonas. This was repeatedly confirmed by PCR performed in saline washes of the sterilized bronchoscopes (data not shown). Reads mapped to the Halomonas genome were manually removed from further analysis.

Normalization
The raw read counts were normalized by scaling them to the range 0 through 1, which was achieved by dividing each entry by the sum of all the entries for that subject. For each group, OTUs with non-zero normalized counts in less than 50 % of the members were eliminated. This means that the set of OTUs retained could be different for each group. As this would create bias in comparative and biomarker analyses (LEfSe, DESeq2, PLSDA), we kept OTUs with non-zero normalized counts in more than 50 % of one of the groups in all groups for those analyses, creating a uniform OTU set.

Bioinformatic analyses Compositional analyses
Alpha diversity was measured using Chao richness [32] and inverse Simpson diversity [33] indices. The DESeq2 [34] algorithm was used to find distinguishing taxa (based on abundance) for each group.

Biomarker analyses
The linear discriminant analysis effect size or LEfSe analysis [35] was used for the identification of potential biomarkers (P<0.05, LDA effect size >1) to produce a set of taxa with relative abundances most significantly different between each group and the other two. This analysis determines the organisms and clades that best differentiate the groups under study (AS, FS, NS) by combining effect relevance with biological consistency. By providing an effect size, this method can also rank the significance of Relative abundance at the level of phyla. The relative abundance was plotted for never smokers, former smokers and active smokers at the phylum level. The columns correspond to the subjects studied. the biomarkers (i.e. taxa). Finally, the visualizations provided by LEfSe places these biomarkers on taxonomic trees, allowing for improved biological interpretations. LEfSe can be especially useful for discovering differences in taxa with relatively low relative abundance, and that is not necessarily apparent in the compositional bar graph.

Clustering analyses
Principal coordinate analysis (PCoA) [36] was used to analyse overall data clustering, with unweighted Unifrac as a measure of beta diversity within and between cohorts. Three-dimensional visualizations of the results were produced using Emperor (0.9) and ggplot2 (3.3.3). Partial least squares differential analysis (PLS-DA [37]) was used to determine the degree of separation between the three cohorts of samples. Markov clustering (MCL [38]) was used to cluster nodes in the CoN, and these same clusters were used in the heatmaps to illustrate tightness.  *Microbes with species and strains associated with normal oral flora [45]. †Microbes with species and strains associated with normal lung flora [11]. AS, active smoker; blue, FS/NS vs AS; FS, former smoker; NS, never smoker; Pink, AS/FS vs NS; yellow, AS/NS vs FS.

Network analyses
Proclivities of OTUs to co-occur in each of the subject groups were investigated using pairwise correlations computed using SparCC (P<0.05) [39]. SparCC iteratively estimates correlation using log ratio transformations of relative abundances, and operates under the assumption that microbial data are sparse. Network diagrams were constructed from pairwise correlation matrices using Cytoscape [40] to better visualize the complex interactions between OTUs in each of the subject groups. Node sizes were set to reflect the relative abundance of each OTU. For each group, the correlation coefficients were calculated for each pair of OTUs, and the thickness of the edge was set to reflect the magnitude of the correlation value. The resulting matrix was then represented by a network that integrated many vital pieces of information in a single plot to help visualize relationships between OTUs present in each specific group using Cytoscape [40], and were visualized using the Fruchterman-Reingold algorithm [41].

Cohort
The basic demographic profiles of the 55 subjects enrolled are shown in Table 1. All groups had similar distributions for age, gender and lung function parameters. All 46 smokers had the same smoking exposure in pack-years, with the FS quitting an average of 10.2±9.7 years before participation (range 1.5 to 34 years). As expected, the smoker populations, FS and AS, have reduced forced vital capacity (FVC), forced expiratory volume at second 1 (FEV 1 ) and diffusing capacity for carbon monoxide (D L CO), but the differences were not significant by analysis of variance (ANOVA) (data not shown). The 24 FS and the 22 AS differed significantly (χ 2 test, P<0.05) only in a higher prevalence of clinical chronic bronchitis in the AS group. Where appropriate, values are provided as an average value ±standard deviation along with the median value within square brackets.

Microbial abundance and smoking status
On average, more than 3600 reads with an average length of 479 nucleotides were recorded in each subject's BAL, which allowed the characterization of approximately 400 OTUs per subject. Table 2 shows the statistics for each group, including the number of OTUs recovered, unclassifiable reads (k-mer size=7, confidence cutoff=60, cluster sequence similarity distance cutoff=0.03 %), and richness and diversity indices.
The LRT compositional profile in all three study groups, namely NS, AS and FS, is represented in Fig. 1. The NS profile is fairly balanced between the dominant phyla Proteobacteria (39 %), Firmicutes (31 %) and Bacteroides (21 %), followed by Fusobacteria and Actinobacteria with relative abundance in the single digits (7 and 2 %, respectively). The FS group showed a marked increase in Proteobacteria (61 %) at the expense of Firmicutes (19 %) and Bacteroides (dropping all the way to 4 %, even lower than Fusobacteria). This trend is also true in the active smokers, with Proteobacteria composing an overwhelming 75 % and Firmicutes dropping to 11 %. The distribution of recovered OTUs for each subject group is visualized in Fig. 3. These Krona images allow intuitive exploration of relative abundances within the complex hierarchies of metagenomic classifications [42]. The genus-level analyses (Fig. 3)  show that most of the increase in Proteobacteria in FS and AS compared to its abundance in NS was contributed by the genus Ralstonia, which increased from 2 % (NS) to 21 % in FS and 28 % in AS. From the phylum Firmicutes, the genus Streptococcus (reduced from 18 % in NS to 14 and 9 % in FS and AS, respectively) and the genus Veillonella (reduced from 8 % in NS to 1 and 2 % in FS and AS, respectively), in addition to the genus Prevotella from the phylum Bacteroidetes (15 % in NS to 2 and 5 % in FS and AS, respectively), were the taxa with the highest reductions in relative abundance. Each of these are known members of healthy lung microbiota [11]. The genus Propionibacterium, from the phylum Actinobacteria, showed a small increase from 0.8 % in NS to 3 % in FS and AS, respectively.
Using DESeq2 [34] we conducted a pairwise comparison of individual OTUs to determine OTUs found to be significantly more abundant in each subject group over one or both other groups (P<0.05), summarized in Table 3. P-values were corrected for multiple comparisons using the Benjamini-Hochberg method [43]. Each column is labelled by a cohort (NS, FS, AS), and contains a list of the taxa that are differentially abundant in that cohort. Interquartile range calculations are performed for each sample based on taxa relative abundance, and taxa are subsequently ordered by their median quartile over all samples in the corresponding cohort. Ralstonia.(01-03) were upper-quartile taxa in the AS and FS profiles that were differentially abundant when compared to NS in both profiles, offering further support for the observed behaviour in the Krona plots, and also that dominant Proteobacteria was a distinguishing characteristic of AS and FS compared to NS. A subsequent run of DESeq2 at the phylum level did turn up Proteobacteria as distinguishing AS and NS (but interestingly, not FS and NS), along with Acidobacteria. Additionally, two other top-quartile Proteobacteria (Burkholderia.01-02) also distinguished AS from NS. When viewing the NS profile, far more upper-quartile taxa were found as distinguishing from AS [including a Streptococcus, Fusobacterium, Veillonella (distinguished from both other groups) and Haemophilus], while the lower-quartile taxa tended to distinguish from FS. The three top-quartile NS taxa: Streptococcus.03, Fusobacterium.02 and Veillonella.03 were also discovered by ANCOM with bias correction (ANCOM-BC [44]) as distinguishing for NS (it did not uncover any as distinguishing for FS or AS). The NS profile had a total of eight taxa that have previously been reported as members of healthy oral [45] and lung [11] microbiomes. The FS profile had two of these taxa, and both distinguished from AS. AS had one distinguishing from FS.
Ecological analytical measures routinely used in human microbiome studies were also used for analysis. OTU richness and diversity were computed using the Chao [32] and inverse Simpson [33] diversity indices, respectively. NS exhibited markedly higher average richness (ANOVA [46], P=1e-9) and diversity (ANOVA, P=0.00075) compared to FS and AS (Fig. 4a, b), within errors for richness, although diversity had a wider error bar. This was more evident when the subjects were ordered by decreasing richness (Fig. 4c), suggesting that NS had higher richness. However, the diversity measured using the inverse Simpson index showed only a moderate correlation with the richness values and the smoking status (Fig. 4c). Table 4 shows Pearson correlations between richness and diversity along with associated P-values across all samples, and within each group. Interestingly, the NS showed the highest (0.75) and most statistically significant (P=0.02) correlation between richness and diversity, with the FS showing a weaker correlation (0.4) that is still biologically significant (P=0.05). The AS experienced almost no correlation at all (0.09, P=0.69). Across the entire set, a statistically significant (P=3e-7) positive correlation of 0.63 was observed between richness and diversity. Faith's phylogenetic diversity index [47] also showed a higher value within errors for NS (ANOVA, P=2e-8) compared to AS and FS (Fig. 4d).

Biomarker analyses
Next, we used the LEfSe to identify the classes of bacteria contributing significantly to the differences observed between the three cohorts [35]. LEfSe even takes into account the low-abundance taxa in a sample used for the identification of potential biomarkers (P<0.05, LDA effect size >1) to produce a set of taxa with abundances most significantly different between the cohorts (Fig. 5). The cladogram (Fig. 5a) and the histogram with the LDA scores ( Fig. 5b) show taxa that are statistically and biologically differential  between the three groups. While our earlier compositional analysis provided feedback on differences between highly abundant taxa, LEfSe can be helpful for discovering less abundant taxa that tended to only occur in one of the three groups, and considers their taxonomic relationships. Interestingly, only the NS group came away with any distinguishing taxa, including the genera Prevotella, Veillonella, Haemophilus, Mogibacterium, Eubacterium, Lautropia, Megasphaera and Methylobacterium. Five of these -Prevotella, Veillonella, Haemophilus, Lautropia and Methylobacterium -were also uncovered by DESeq2 as distinguishing the NS samples. Eubacterium [45] has previously been reported as part of a healthy oral microbiome. We also see phyla -Firmicutes and Bacteroides, which makes sense given our earlier results with the box-and-whisker plots -where Proteobacteria was much more dominant in the AS and FS samples compared to NS at the expense of both Firmicutes and Bacteroides. Table 5 shows differentiating taxa identified using LEfSe, ordered by taxonomic classification. All of these were reported as biomarkers for the NS group.

Discriminant analyses
To determine if dimensionality reduction of the data points can lead to separation of the three groups, PCoA [36] was performed using unweighted Unifrac distance as the measure of beta diversity [48]. The separation between the three clusters ( Fig. 6a), the FS, AS and NS, was clear, supporting a claim of compositional differences in the LRT due to smoking. This separation was confirmed by PERMANOVA [49], with a P-value of 0.017. Fig. 6b supports that the NS group was more differentiated than the other two groups based on unweighted Unifrac distance, although more variance can be seen in the AS group, which was confirmed by BetaDisper [50], with an average distance-to-median of 0.36 compared to 0.29 for FS and NS. When we move to weighted Unifrac (Fig. 6c), the separation is not as clear, indicating that rarer members of the microbiome may potentially be more distinguishing between the three sample sets. Collectively, the genera reported by LEfSe for distinguishing NS samples (Prevotella, Veillonella, Haemophilus, Mogibacterium, Eubacterium, Lautropia, Megasphaera and Methylobacterium) comprise on average 19.1 % of the relative abundance of those samples, offering some additional support for this claim.
To understand the differential relative abundance of the OTUs within the three study groups, we utilized a bilinear factor model, the partial least square discriminant analysis (PLS-DA [37]) method, which finds two (or more) composite directions (variates) that best separate the groups (Fig. 7). Again, clear separations were observed between the NS and AS groups (Fig. 7a) and the NS and FS groups (Fig. 7b). The separations between the AS and FS groups were less clear ( Fig. 7c), also supported by the fact that the FS group showed overlap with both NS and AS clusters, when all three groups were analysed simultaneously (Fig. 7d). M-fold cross-validation (CV) errors were also reported in the legend, indicating the mean-squared error of prediction.

Correlation between OTUs
The matrix of correlations (computed using SparCC [39]) between the vectors of relative abundance values of OTUs within each subject group was analysed using a fast and scalable unsupervised cluster algorithm called the Markov cluster algorithm (MCL) [38]. A heatmap was constructed from the matrix with green and red colours for positive and negative correlations, respectively (Fig. 8). The bacterial OTUs were then clustered using the correlation values as the distance function. The heatmap was reorganized by permuting rows and columns such that those corresponding to the same cluster of OTUs were located next to each other. The bright green square regions along the diagonal represent clusters, and the brightness indicates the tightness of the associations. The output shows that there is detectable clustering of OTUs in each of the three cohorts. The NS heatmap reveals tighter clusters with one large one (Fig. 8a). The FS clusters (Fig. 8b) appear to be better formed compared to the AS group, where the clusters are smaller and more weakly associated (Fig. 8c, Table 6).

Bacterial co-occurrence networks
To estimate the ecological relationships [26,51] between the microbes in these samples, we constructed a microbial co-occurrence network (CoN) as described previously [26]. When visualized as a network (Fig. 9), the distinct clusters of OTUs that tended to co-occur together within each subject group were distinguishable. An overall comparison of the OTU networks showed that the NS clusters were tighter (thick and shorter edges; Fig. 9a). The OTU clusters observed in AS subjects were weaker (thinner and longer edges) (Fig. 9c), suggesting a weaker overall community structure. Quantification of OTU co-occurrence confirmed the 'tightness' in the NS clusters (correlation coefficient=0.79) compared to those in smokers (0.46 and 0.45 in FS and AS, respectively) ( Table 7). The number of social clubs was higher in NS (5) compared to FS (3) and AS (4). AS had two rival club pairs, while NS and FS each had only one (Table 7).  Table 6. Table 6. Distinctive OTU clusters (clubs) observed in each of the subject groups, and the average correlations within each cluster*. The number next to each cluster represents the mean strength of association between OTUs in that particular cluster, from 0 to 1 (strongest). Cluster colours correspond to those used in the networks (Fig. 9 Veilonella.03

Continued
The compositions and strength of associations of the most prominent OTU clusters detected in each subject group are detailed in Table 7 (with clusters named arbitrarily and not meant for comparison between subject groups). It is noteworthy that one cluster is shared across all three groups (cluster A), distinguished only by the presence of Staphylococcus in AS and NS groups, Escherichia/Shigella in the former group, and Corynebacterium in the NS group. Active and former smokers also share a similar cluster in their respective cluster B, distinguished by the strong Ralstonia presence. The largest group with the strongest correlation in NS, group C, is largely dominated by Prevotella and Actinomyces, plus a Streptococcus OTU. Taxa from these genera also co-occurred within the other two groups, just within smaller and less tight-knit communities. Rivalries in AS occur between clusters B (heavy Ralstonia) and C (heavy Prevotella/Streptococcus), and also clusters A (high phylum Proteobacteria and Propionibacterium) and D (two Actinomyces taxa, plus Prevotella, Streptococcus and Veillonella). The FS cluster B (also high Ralstonia) rivalled its cluster C (distinguished by Prevotella and Veillonella). The single rivalry in the NS network was between cluster A (Proteobacteria/Propionibacterium) and C (Prevotella/Actinomyces, and a single Streptococcus).
Ralstonia was earlier reported in our Krona charts as much more abundant in FS and AS groups compared to NS, and all three Ralstonia taxa were reported by DESeq2 as distinguishing FS and AS samples compared to NS. Here, we see a nearly identical Ralstonia-heavy cluster of taxa forming in the FS and AS networks, in both cases rivalling a cluster consisting of many previously reported healthy oral and lung microbes, including Streptococcus, Veillonella, Prevotella, Haemophilus and Fusobacterium. All five of these taxa were reported by DESeq as distinguishing the NS group (three by LEfSe as well), which leads to questions regarding the potential impact of Ralstonia in the FS and AS microbial ecosystems. One other important observation is that a small community of co-occurring bacteria (cluster A) were found to be common among all three groups. Possible explanations of this phenomenon could that these bacteria may be part of a 'core' microbiome playing an important functional role in the lung.
Note that as we performed 16S analysis using OTU clustering, we kept all of our analyses at the genus level, where this method has previously been illustrated to be most accurate. As a reference, however, we took the representative sequence of every OTU and ran blast [52] against the database of genomes from its corresponding genus. Table 8 shows the closest matching species and strain for every one of our taxa.

DISCUSSION
This study aims to understand the LRT microbiome's organization and explore changes that may relate to smoking status beyond looking at just the abundance and diversity. Burkholderia.03 *Average correlations for the edges between strongly rival clusters (clubs). Table 6. Continued Fig. 9. Construction of microbial co-occurrence networks. Proclivities of OTUs to co-occur (or co-avoid) in each of the subject groups, (a) NS (b) FS and (c) AS, are visualized as network diagrams constructed from pairwise correlation (SparCC [39], P=0.05) matrices using Cytoscape. Each node corresponds to an individual OTU. The sizes of the nodes are proportional to the relative abundance of the taxa they represent. The edges are coloured green or red depending on whether the taxa co-habit (positively correlated, as in a 'social club' [26]) or co-avoid (negatively correlated, as in 'rival clubs'). The thickness of an edge is indicative of the strength of the correlation, with stronger associations represented by thicker edges. The Fruchterman-Reingold algorithm [41] was used for visualization, making the location of a node with respect to its neighbours indicative of its co-occurrence strength with that neighbour. Nodes are coloured by clusters.

Never smokers Former smokers Active smokers
of oropharyngeal species [16,19]. Using a neutral model of community ecology, Morris et al. concluded that dispersal from the mouth is mostly responsible for the lung's microbial composition [25]. It has been reported that the upper airway microbiome resembles that of the oral cavity more than the microbiome of more distal airways, suggesting that there may be a gradient of aspirated bacteria [18]. BAL samples dominated by bacteria typically found in oral microbiomes are associated with an elevation of inflammatory markers [54]. Our study shows that healthy nonsmoker BALs have significantly higher relative abundances of several classic LRT microbes, such as Prevotella, Streptococcus and Veillonella (Fig. 3), as previously described [10,16,25].
Other known oral microbes [45] were also found, including Neisseria in former smokers and Haemophilus, Leptotrichia and Fusobacterium in nonsmokers. Some of our results matched those of Morris et al. [25] for nonsmokers, with elevated Prevotella, Streptococcus, Veillonella, Fusobacterium and Haemophilus (NS) and Neisseria (FS). Their study also found several other genera to be differentially abundant in nonsmokers' lungs (both former and never). Differences in our findings can be attributed to the choice of the region of the 16S rRNA gene utilized in recovering the metagenome [55]. Regardless of their origin, these bacterial species may be worth further exploration, as they can potentially play an important role as local immune system regulators, similar to commensal bacteria in other body niches [24, 56,57].

Role of Ralstonia
Several studies have addressed the impact of smoking on the microbiome of the nasopharyngeal and oral cavities, but only one specifically addressed the effect of smoking on the human LRT microbiome [25]. Morris et al. detected disruption of oral microbial communities but found no significant differences in the BAL microbiome or OTU profiles between nonsmokers and healthy smokers [25]. Our study indicates the smoking groups can be differentiated using general ecological measures such as richness and (to a lesser degree) diversity, as well as using PCoA, LEfSe and PLS-DA (Figs 5-7). In particular, Fig. 7 showed that the former smokers appeared to occupy an 'intermediate' state that showed overlap with both the active and never smokers, which illustrates that a more granular study for former smokers that accounts for variables such as the amount of time since quitting smoking, pack years, etc. might be relevant for future work.
In particular, the increased presence of multiple OTUs of Ralstonia in the LRT of active and former smokers, and the significant role they appear to play in the co-occurrence networks, is particularly noteworthy. Ralstonia OTUs have been noted as pathogens multiple times by the Centers for Disease Control and Prevention (CDC) [58][59][60] and have been implicated in infections in COPD [61] and cystic fibrosis patients [62]. Their increased relative abundance in the LRT of active and former smokers was at the expense of several core LRT taxa, including Streptococcus, Prevotella and Veillonella, as evidenced by the rival clubs in both networks between the Ralstonia-dominated cluster and clusters containing these core taxa. These co-avoidance relationships in the smokers' networks may be interpreted as a disruption of the strong community structure observed in the nonsmokers' network and may be signs of dysbiosis in the microbiomes of smokers' LRT.
An accurate characterization of the human microbiome is paramount to appreciating its role in human health and disease and understanding the complexity of microbial systems [63,64]. Microbes co-exist in communities with complex inter-species interactions, both positive and negative, that can affect the functional dynamics of their impact on the host via the virulence traits of some members [56,65,66]. The overall effects on the human host may represent the effects of the entire community rather than that of its individual members. Thus, the study of the human microbiome requires more in-depth microscale analysis to determine its complexity and variations with respect to health and disease, in order to provide a perspective at a macroscale level.

Co-occurrence networks
Cluster analyses of the co-occurrence networks reveal groups of bacterial OTUs that tend to occur together, as well as groups that tend to co-avoid. Active smoking clearly alters the relative abundance of specific bacterial taxa and is associated with disruptions in the strong co-occurrence relationships between taxa found in healthy individuals. AS exhibited alternative clustering of co-occurring bacterial taxa, several of which contained genera with opportunistic pathogens.

Shared microbes
In our co-occurrence networks we observe a potential 'core' group consisting of the following taxa: two OTUs from the genus Propionibacterium, and OTUs from Delftia, Pseudomonas and Stenotrophomonas. Possible explanations for this common cluster  Continued could be that these microbes play some housekeeping role in the lung or are involved in some processes and pathways unaffected by smoking.

Community structure changes
By integrating and visualizing all the positive and negative associations of individual OTU pairs in a network, we observe mutualistic clusters in all subject groups. The 'tighter' alliances seen in nonsmokers' LRTs may be indicative of greater eubiosis when compared to subjects exposed to tobacco smoke. We also found that several bacterial clusters exhibited a robust negative  Table 8. Continued correlation ('rival clubs') in smoking groups, suggesting co-exclusion of these taxa. Rivalry may thus represent group co-avoidance, perhaps a form of dysbiosis, suggesting the likelihood of subtypes of microbiomes within the groups, which can be better elicited with larger study cohorts. Rivalry may also represent driving forces for competitive exclusions, which may favour important pathogenetic species [63]. Overall, descriptions of these complex ecological relationships are crucial, since much of what is known about microbiome-host immune interactions has come from the study of single bacterial pathogens. Future analyses with larger groups of healthy individuals from different geographical locations will further define the healthy complex microbial diversity of the lungs.
Smoking was associated with weakening bacterial cluster tightness and increased rivalries between clusters, which is more pronounced in the network for the active smokers than in the former smokers, suggesting that smoking plays a role in the lung community dynamics. It is not clear if the changes are due to a direct effect of cigarette smoke, or epiphenomena linked to the anatomical, physiological and immunological alterations associated with smoking. These alterations specifically pertain to immune/inflammatory changes in the host lung due to smoking [64], while numerous byproducts of host inflammation act as bacterial growth factors [57]. On the other hand, cigarette smoke induces inflammation partly via Toll-like receptor 9 (TLR9), which is activated by bacterial DNA [67]. The cycle of inflammation and bacterial growth has been described as central to the development of lung disease [68,69]. For that reason, it has been advocated that therapies that reduce inflammatory parameters need to take into account the presence of persistent colonizers and host-bacterial interactions to be more effective [70].
LEfSe analysis was able to identify biomarkers for each group. Finding microbial biomarkers for smoking, which is a behavioural trait, is important since it may lead to interventions, therapies and treatments of diseases associated with such behaviours.
Although this is one of the largest studies using bronchoscopy samples on stable individuals, it still lacks the power to detect critical bacterial associations or measure smaller but essential differences in community structure. More extensive studies that incorporate whole-genome sequencing, amplicon sequence variants [71] and/or analysis integrating all 16S rRNA gene variable regions may improve the identification of individual species and help overcome these limitations [72]. High-throughput DNA molecular techniques are currently unable to detect bacterial viability, limiting conclusions regarding disease pathogenesis. The present study focused on bacteria, only one component of the total microbiome that encompasses viruses, fungi [73], archaea and other non-fungal eukaryotes, a limitation that can be overcome by alternative techniques such as whole-genome sequencing [74,75]. Finally, this was a cross-sectional study and does not provide information about the long-term composition of the lung microbiome and its relation to disease susceptibility and progression. Incorporating time series analysis [76] and causality [77] is important for future work to increase the limited perspective provided by microbiome snapshots. Causality determination is challenging; however, the findings may help design future prospective studies that lead to a better mechanistic understanding of the smokers' lung microbiome.
In conclusion, using comprehensive integrative analysis tools, we present a better characterization of the complex microbial communities of the LRT and how they change based on the host smoking status. It is clear that significant differences occur in the LRT microbiome due to smoking. It also appears that oral microbiota can seed the lung, especially in smokers, making an upper airway microbiome study also interesting for future work. Based on the PLS-DA analysis, former smoker microbiomes seemed to share properties observed in the lung microbiomes of both active smokers and non-smokers. Replication of these findings, followed by integration with next-generation analytical tools, will likely establish the impact of these microbial clusters on human health.

Conflicts of interest
The authors declare that there are no conflicts of interest.

Ethical statement
The study presented here is a prospective single-centre cohort study approved by the University of Miami's Institutional Review Board (dated 18 March 2010; HSRC study number: 20091147). All sample collections were carried out in accordance with relevant guidelines and regulations. All experimental protocols were approved by the University of Miami's IRB. Informed consent was obtained from all subjects or, if the subjects were under 18, from a parent and/or legal guardian.

Peer review history
Comments: I thank the authors for addressing all of my concerns, comments, and suggestions and I am happy for this work to be published in it's current form. Below are some comments in response to changes and questions that arose during the review.
-The changes to OTU filtering during the quality control step of 16S rRNA data is a method that I lean towards using in my own work as it increases data quality by removing low abundance/prevalence OTUs while also retaining some biological signals such as the example used by the authors. -Regarding my request for individual p-values, Some journals request that a p-value is provided for each comparison referred to in the text or used to draw conclusions, but this varies between journals and I wasn't sure what the policy was here. It is acceptable to state that these were below a threshold which is logical and clearly stated. -It is unfortunate that a gradient-like relationship or intermediary lung microbiome (for the former smoker samples) wasn't supported by the variables used here. I agree that this would be a very interesting avenue of investigation that would require a lot of work and is beyond the scope here.

Please rate the quality of the presentation and structure of the manuscript Good
To what extent are the conclusions supported by the data? Strongly support Thank you for submitting your paper to Access Microbiology. It has now been reviewed and I would like you to revise the paper in line with the reviewers' reports and any Editorial Office requirements below. The reviewer reports can be found at the bottom of the email.
Editor comments:

Firstly, thank you to both peer reviewers for their time and expertise assessing this manuscript-it is appreciated.
Having read through the reviewers comments there are several questions about the approaches used to analyse the data, questions about the statistic analysis performed, and image quality issues with some of the figures.
Please submit the revised version of your manuscript by 19/11/2022. If you need longer than the suggested time-frame, please contact the Editorial Office in advance to agree a different deadline at acmi@ microbiologysociety. org.
Please note that some revisions are peer reviewed and submitting a revised paper does not guarantee that it will be accepted.
Editorial Office requirements: 1 During submission you requested that your article was posted online immediately without corrections. This means that when you submit your revised article, it will be run through the Review Tools a second time and this second report will be posted alongside the preprint. When submitting your revised article, please therefore ensure you update the SciScore submission question with any revised manuscript text.
Okay, we will do this. We placed an asterisk next to Melita Jaric's name, and indicated her day of passing (3/8/13) in the affiliations.
We do have an Author Contributorship section and followed the guidelines on your website, using the CRediT taxonomy. It was not just above the references, it was two above. We moved it down for clarity. Note that Melita Jaric's (MJ) contributions are listed there.

Submitting a revised paper:
To submit a revision, go to https://www. editorialmanager. com/ acmi/ and log in as an Author. You will see a menu item called 'Submission Needing Revision' . You will find your submission record there. If this manuscript involves human and/or animal work, have the subjects been treated in an ethical manner and the authors complied with the appropriate guidelines?
Reviewer 1: Yes: Reviewer 1 Comments to Author: This paper describes a 16S rRNA based microbiome analysis of samples collected from the lower respiratory tract of people who have never smoked, are former smokers and are current smokers. The work is an improvement upon previous analyses as it is based on BAL samples, which should limit the amount of contamination from the upper respiratory tract. Overall the paper is well clear and the approach seems sound, but I have a few comments: 1) The authors imply that their study is an improvement upon others as they have more balanced group sizes. However in this paper they had 9 never smokers compared to 24 former and 22 active smokers. The authors need to discuss this in the discussion, and whether they think they had enough data from never smokers to make robust conclusions.

L109: The authors note that the study by Morris et al. is limited because it featured imbalanced group sizes. It is worth noting that the current study also featured an imbalanced design (9 Never Smokers vs. 24 Former Smokers and 22 Active Smokers). If the authors feel that an imbalanced design hinders the validity of results, then I ask them to please explain how their study is not impacted.
Their study had 45 (current) nonsmokers and 19 (current) smokers. Ours has 33 (current) nonsmokers (never+former) and 22 (current) smokers. So in that regard, we were thinking it was more balanced. However since our study was not binary in nature, but tertiary, I don't have an issue removing that criticism. We still go beyond differential abundance and include former smokers, which had not been done in this other study. Changed: The two cohorts in the Morris et al. study had imbalanced sizes and their investigation was confined to surveying the differential abundances of various bacterial taxa. To: The investigation of Morris et al. study was confined to surveying the differential abundances of various bacterial taxa.
2) In the results section the authors often discuss OTUs such as "Ralstonia.03" -these designations are unhelpful for the reader, so a more intuitive way of referring to these OTUs needs to be devised. What species most closely match these OTUs?
Given that we are using a more traditional 16S analysis with OTU clustering (as opposed to a more accurate classification method, like Amplicon Sequence Variants) we are uncomfortable performing a species-level analysis throughout. The 16S gene has much less variation at the species level compared to the genus level. While OTU clustering is routinely >90% accurate at the genus level, that accuracy can easily drop to 60-80% at the species level, and we feel reporting those results could mislead our readers.
As a reference however, we did take the representative sequence of every OTU and ran BLAST against the genomes of its corresponding genus, producing an estimate of the most closely matching species and strain. This is now in a table just before the discussion, and we preface it with the following text: Note as we performed 16S analysis using OTU clustering, we kept all of our analyses at the genus level where this method has been previously illustrated to be most accurate. As a reference, however, we took the representative sequence of every OTU and ran BLAST (45)against the database of genomes from its corresponding genus. Table 8 shows the closest matching species and strain for every one of our taxa.
ASVs could be a future study, and we have now mentioned it more specifically in future work, changing: "More extensive studies with whole-genome sequencing or analysis integrating all 16S rRNA gene variable regions may improve the identification of individual species and help overcome these limitations (59). " To: "More extensive studies that incorporate whole-genome sequencing, Amplicon Sequence Variants (59), and/or analysis integrating all 16S rRNA gene variable regions may improve the identification of individual species and help overcome these limitations (60). " 3) I understand why the bacterial co-occurrence networks were analysed separately in the different cohort groups -but I also think it be more powerful to analyse them as a combined dataset.
We considered your comment, but were unable to see the benefits of doing this and did not see a way to smoothly integrate this network into the writing.
The title of the paper is "Lower respiratory tract microbiome composition and community interactions in smokers". Performing co-occurrence network analysis on the combined dataset would produce one single network for a mixture of individuals who were active smokers, former smokers, and never had smoked. It would thus be challenging, if not impossible, to analyze any impact of smoking on the lower respiratory tract microbiome using that network.
If you disagree, we are willing to produce this network if you really see a clear benefit to our study. If so, please specify why you think this network would be more powerful.

Minor comments:
Line 155 "composing a whopping 75%" -I don't think whopping is appropriate for an academic manuscript Changed 'whopping' to 'overwhelming'

Line 295: please explain the cross-validation errors more -what does this mean?
Changed: "Cross-validation errors were also reported. " To: "M-fold cross-validation (CV) errors were also reported in the legend, indicating the mean-squared error of prediction. "

Fig -6 needs bigger legends
Fig. 6A's legend was the same as Fig. 5, so we added that one to Fig. 6A.
We increased the font size of the legend in Fig. 6B.

Reviewer 2: Satisfactory
Please rate the quality of the presentation and structure of the manuscript The authors present a compelling profiling of the lung microbiome structure in active, former, and never smokers. The analysis methods themselves are reasonably robust but require a few clarifications regarding quality control metrics and statistical methods. I have also suggested a few areas where their observations should be supported by statistical analysis before publication. I look forward to reading the revised manuscript.

Major Issues
Throughout: Lacking appropriate statistical testing to support observations. L55: DESeq2 has been previously shown to have a relatively large false discovery rate compared to other, more conservate, microbiome-tailored approaches (such as ANCOM and ALDEx2) when applied to 16S microbiome data as the underlying statistics are more suited to RNA-Seq gene expression data [https://www. nature. com/ articles/ s41467-022-28034-z]. I would urge the authors to consider reanalysing their data using one of these approaches.
We ran ANCOM, with the bias-correction (ANCOM-BC). The top three taxa distinguishing the Never samples also showed up in this analysis. We indicated this: The three taxa: Streptococcus.03, Fusobacterium.02and Veillonella.03were also discovered by ANCOM with Bias Correction (ANCOM-BC, (30)) as distinguishing for NS (it did not uncover any as distinguishing for FS or AS).

L109: The authors note that the study by Morris et al. is limited because it featured imbalanced group sizes. It is worth noting that the current study also featured an imbalanced design (9 Never Smokers vs. 24 Former Smokers and 22 Active Smokers).
If the authors feel that an imbalanced design hinders the validity of results, then I ask them to please explain how their study is not impacted.
Their study had 45 (current) nonsmokers and 19 (current smokers). Ours has 33 (current) nonsmokers (never+former) and 22 (current) smokers. So in that regard, we were thinking it was more balanced. However since our study was not binary in nature, but tertiary, I don't have an issue removing that criticism. We still go beyond differential abundance and include former smokers, which had not been done in this other study. Changed: The two cohorts in the Morris et al. study had imbalanced sizes and their investigation was confined to surveying the differential abundances of various bacterial taxa. To: The investigation of Morris et al. study was confined to surveying the differential abundances of various bacterial taxa.
L648: It is good practice to remove OTUs from a dataset if they fail a defined inclusion criteria (such as zero in > 50% of samples or < 0.1% mean relative abundance across the whole dataset) but they should always be removed from ALL samples, not just a single group. The decision here to only remove OTUs from a single group is odd and will definitely bias the results by making it appear as though an OTU was not detected at all in that group.
We agree on this bias being present. Taxa appearing as present in one set but not another will impact biomarker (LEfSe and DESeq) and differential (PLS-DA) analysis. Therefore, we have modified these as follows: while we still keep taxa that are in at least 50% of the samples of one of the groups -we now keep them in *all* groups, not just the group where they were found to be in at least 50% of the samples. This creates a uniform set for each of these analyses and should eliminate this bias.
We are reluctant to apply this 50% rule to the entire set, as there is too much risk of removing a taxon that is highly specific to a particular group (which would be one that is very interesting). For example, if a taxon is only present in Active samples, which compose less than 50% of the sample set, it would be removed using this approach. It was for this reason, from the beginning, that we decided to do it this way.
Thus for the rest (bar charts, krona plots, networks), where we are not comparing and contrasting the presence of a taxonin one group vs. another, we still see it beneficial to visualize a representative set of taxa each group, which we define as present in 50% of the samples of that group. We hope this sounds acceptable.
We added the following clarification after the sentence "This means that the set of OTUs retained could be different for each group. "to Methods: As this would create bias in comparative and biomarker analyses (LEfSe, DESeq2, PLSDA), we kept OTUs with non-zero normalized counts in more than 50% of one of the groups in all groups for those analyses, creating a uniform OTU set.
We rewrote the LEfSe, DESeq2 and PLS-DA sections with the new results.
Updated the figure caption as well.
L78: I suggest clarifying that these marked differences are from previous studies and don't refer to the results presented here.
Agreed, added "in previous studies" to the end of the sentence.
L84: My understanding of the LEfSe is that the tool identifies biomarkers or differentially abundant features. It does not perform a classification step -that would need to be performed using Random Forest classification (or similar) with those differentially abundant features as the predictor variables. Please clarify that this study just identified biomarkers with the potential to classify smokers vs. non-smokers (I don't think they did the classification step, based on a quick read of the paper).
This was actually another study, not ours. The point was that although LEfSe is a tool for identifying biomarkers, because the signatures were so well-defined in this salivary microbiome study, the biomarkers themselves could be used for classification.
To make this clear, changed the sentence from: The salivary microbial signature at the genera level is so well differentiated using Linear Discriminant Analysis Effect Size (LEfSe) that the tool can be used to classify smokers and nonsmokers [8]. To: The salivary microbial signature at the genera level is in fact so well differentiated that biomarkers discovered using Linear Discriminant Analysis Effect Size (LEfSe) can be used to classify smokers and nonsmokers [8].
L87: As this is based on Weighted UniFrac distance, this suggests to me that the more abundant members of the tongue microbiome return to a 'healthy' state. Did the authors of that paper report a difference in unweighted unifrac between former and never smokers? If so, then it might be that the rare microbiome is still disrupted and could be investigated here.
No, the results were similar with unweighted UniFrac and Bray-Curtis as well.
Changed: "weighted UniFrac distance" to: "weighted UniFrac, unweighted UniFrac, and Bray-Curtis distances" L144: This is a relatively large proportion of unclassified reads. What cutoffs did the authors use to determine whether a read was unclassified? Additionally, in relation to the methods section, were the OTUs classified using the RDP classifier tool or BLAST alignment against the RDP database? What was the cutoff used to label a read "unclassified"? (eg. confidence or e-value cutoff).
Regarding the cutoffs, this information is in the methods. For clarity, we also provide it in the area you specify (line 144): (k-mer size=7, confidence cutoff=60, cluster sequence similarity distance cutoff=0.03%) The RDP was actually a mistake, so we are glad you pointed this out. We had begun by using that database, but wound up using SILVA in the end. Changed: "and classified against the RDP Database (ver. 9)" to: "and classified against the SILVA reference database (ver. 132)" using L165: Please clarify whether "dominance" is subjective or based on mean relative abundance across the dataset.
Added (based on mean relative abundance) as a qualifier in parentheses.

L165
: Please state what the whiskers represent? I would normally expect the whiskers to represent the min and max value, but outlier points are also included. Do they represent the min and max with outliers excluded?
That is correct; these are outlier points. Changed: " Fig. 2 shows box and whisker plots summarizing the relative abundance of the four most abundant phyla in all the samples. " To: " Fig. 2  Yes, the taxa mentioned in the following paragraph are all reported as differentially abundant by DESeq2. Are you asking for a p-value for each one? Note that our first sentence mentions the p-value threshold, which by extension means that all taxa that were found as differentially abundant have p-values less than 0.05, which we are assuming to be biologically significant: Differentially abundant taxa were identified using DESeq2 with a p-value threshold of 0.05.
Does this address your first concern?
Regarding your second, yes, the p-value was corrected for multiple comparisons using the Benjamini-Hochberg method. To make this clear, we added the sentence: P-values were corrected for multiple comparisons using the Benjamini-Hochberg method (27).
L184: Labels are low resolution and difficult to read. These plots are often more suited to interactive exploration on a computer than static in a manuscript. Please increase resolution or use an alternative method of plotting.
We regenerated the images at a higher resolution.
L190: I suggest clarifying that this refers to taxa that are more abundant in one group compared to one OR both other groups. Currently, this reads like this analysis looking for genera which are more abundant in one group compared to BOTH groups (combined together). Similar to how LEFSe identifies biomarkers in strict mode.
We agree this sentence was not worded properly. Changed: "Using DESeq2 [28]we conducted a pairwise comparison of individual OTUs to determine OTUs found to be significantly more abundant in each subject group over the other two groups(p<0.05), " To: "Using DESeq2 [28]we conducted a pairwise comparison of individual OTUs to determine OTUs found to be significantly more abundant in each subject group over one or both othergroups(p<0.05), " L191: Please specify how these quartiles are calculated? Is it across the entire dataset or within the specified group? Also please specify that the quartiles refer to relative abundance, not just non-normalised counts per taxa.
They are within the specified group. Changed: "summarized in Table 3, ordered by their median quartile after performing interquartile range calculations on each sample in the column. Each column is labeled by a cohort (Never, Former and Active), and contains a list of the taxa that are differentially abundant in that cohort " To: "summarized in Table 3. Each column is labeled by a cohort (Never, Former and Active), and contains a list of the taxa that are differentially abundant in that cohort.Interquartile range calculations are performed for each sample based on taxa relative abundance, and taxa are subsequently ordered by their median quartile over all samples in the corresponding group." L193: I suggest keeping the group labels consistent throughout. In the text the groups are referred to as NS, FS, and AS but the S is dropped in this table.
We agree. We changed the text from: ". The cohort that was used for the comparison is provided in parentheses (N=Never, F=Former, A=Active);" To: ". The cohort that was used for the comparison is provided in parentheses (NS=Never, FS=Former, AS=Active);" For further clarity, we also modified the column headers in Table 3  L195: I suggest repeating that you are only listing taxa which are more abundant in that particular group. Changed: "The cohort that was used for the comparison is provided in parentheses (N=Never, F=Former, A=Active); taxa differentially abundant with respect to both the remaining groups do not have any cohorts mentioned in parentheses and are marked in bold. " To: "For each column, taxa differentially abundant with respect to one other group have the second group provided in parentheses (NS=Never, FS=Former, AS=Active); taxa differentially abundant with respect to both the remaining groups do not have any cohorts mentioned in parentheses and are marked in bold. " L202: Were any tests for differential abundance performed at phylum level? The feature count table could be collapsed to Phylum rank based on OTU taxonomy and inputted into DESeq2.
We ran DESeq2 at the phylum level for all three pairs of categories (same p-value threshold of 0.05). Only Active and Never turned up distinguishing taxa. We added the following sentence after our initial reference to Proteobacteria: "A subsequent run of DESeq2 at the phylum level did turn up Proteobacteria as distinguishing AS and NS (but interestingly, not FS and NS), along with Acidobacteria. " We computed Faith's Phylogenetic Diversity index among the three groups, and also a higher phylogeneic diversity was observed for Never smokers. This is now our new Fig. 4D, and we added: Faith's Phylogenetic Diversity index (33)also showed a higher value within errors (ANOVA, p=2e-8)for Never smokers compared to Active and Former (Fig. 4D).
L220: Did the authors attempt to quantilfy this correlation using pearson/spearman etc? It would be informative to see if there was a high degree of correlation between the two metrics across the entire dataset and whether there was a higher correlation within any of the smoking groups compared to the others.
We added a Table (Table 4) containing Pearson correlations for richness and diversity, across all samples, and in the three sample sets.
Added the following: " Table 4 shows Pearson correlations between richness and diversity along with associated p-values across all samples, and within each group. Interestingly, the Never smokers showed the highest (0.75) and most statistically significant (p=0.02) correlation between richness and diversity, with the Former smokers observing a weaker correlation (0.4) that is still biologically significant (p=0.05). The Active smokers experienced almost no correlation at all (0.09, p=0.69). Across the entire set, a statistically significant (p=3e-7) positive correlation of 0.63 was observed between richness and diversity. " L224: Panels C and D are good and show the distribution nicely, but the axes labels and titles are too small and difficult to read. Please rectify. The panels are also quite redundant as the same information is presented in both (with the information switched between the black line and bars). I would suggest sticking to one version of the graph and increasing size to make it more readable.
We decided to keep Figure 4C, we made the axes labels and titles larger, and saved as a high-quality PNG. Also removed the sentence: When the subjects were ordered by decreasing diversity, we can primarily observe the richness spiking with the smoking status,as opposed to anyclear correlation between richness or smoking status and diversity (Fig. 4D).
L226: Please do not use the word "significant" unless a statistical test was used. If it was, please state the test used and p-value.
We are fine with this and removed this line: "Significant differences are particularly notable between Never smokers and the other two groups. " from the caption; as it did not seem an appropriate place to make a visual observation anyway.

L233: Was this with respect to one, or both, of the other groups?
This entire paragraph was redundant and should not have been there (thank you): " Table 3 summarizes the genera that were found to be significantly more abundant in each of the study groups using DESeq2 (29). The first column indicates the quartile to which the particular bacterial taxon belongs to in its cohort. The quartiles are numbered 1 through 4, where 1 is the highest quartile, and the median divides quartiles 2 and 3. " We removed it. "Five of these --Prevotella, Veillonella, Haemophilus, Lautropia, and Methylobacteriumwere also uncovered by DESeq2 as distinguishing the Never samples. " Figure 5: I think the y axis is mislabelled, or else LOD score, and its relationship to LDA, needs to be explained.
You are correct, this should have been LDA score. Fixed.
L276: Was there any difference in weighted unifrac distance like in the study mentioned in the introduction? Differences in unweighted unifrac but not weighted would support a claim that rare members of the microbiome are more important in differentiating. Also, what is the mean sum RA of the taxa identified by LEfSe in each group? Are they rarer members of the microbiome?
We now include weighted Unifrac as our new Fig. 6C, and indeed it does not show the difference that we see in unweighted Unifrac. We add the following (including the reference to our LEfSe results): When we move to weighted Unifrac (Fig. 6C), the separation is not as clear, indicating that rarer members of the microbiome may potentially be more distinguishing between the three sample sets. Collectively, the genera reported by LEfSe for distinguishing NS samples (Prevotella, Veillonella, Haemophilus, Mogibacterium, Eubacterium, Lautropia, Megasphaera, and Methylobacterium) compose on average 19.1% of the relative abundance of those samples, offering some additional support to this claim. Figure 6: It looks as though there is a higher variance within the FS and AS groups. Can you quantify this? eg. using betadisper in R (or equivalent). It could be possible that there isn't a strict "smoking microbiome" but instead it is disrupted and more heterogenous than the "healthy" lung microbiome.
The AS group in particular did experience higher variance, which we confirmed with BetaDisper. Added: " Fig. 6B supports that the NS group was more differentiated compared to the other two groups based on unweighted Unifrac Distance, although more variance can be seen in the AS group, which was confirmed by BetaDisper (37), with an average distance-to-median of 0.36 compared to 0.29 for FS and NS. " Ran PERMANOVA and confirmed (p=0.017). Added the sentence: This separation was confirmed by PERMANOVA (37), with a p-value of 0.017. L290: The use of "good" here is unscientific please consider rephrasing as "clear" or "distinct". Or perform a statistical test to support the separation.
L293: again, "fair bit of separation" is not very scientific.
Removed "a fair bit of ". : it appears to be as though the FS group has an "intermediary" microbiome structure, falling partway between the two other groups. Is it possible for the authors to include variables like "pack years" or "time since quitting" into their analysis to determine whether either of these variables has a relationship with where the FS individual falls within the group? Eg. do individuals that smoked more, or quit more recently, have a micorbiome more similar to active smokers? Does time since quitting determine how similar the microbiome is to either group?
This is an interesting observation, and we were excited to pursue this.
However after doing so, we were unable to establish any clear relationship between either of these variables and (1) the position of a sample in the PLS-DA analysis or (2) the abundances of an group of taxa in one of the rival clubs (your later point below).
There thus appears to be more variables here than meets the eye, warranting a much more in-depth analysis that could probably produce another publication. Given this, we hope it is acceptable to publish what we have and use it as motivation for future work. We add the following sentence to our future work section: In particular, Fig. 7 showed that the former smokers appeared to occupy an "intermediate" state that showed overlap with both the active and never smokers, which illustrates that a more granular study for former smokers that accounts for independent variables such as the amount of time since quitting smoking, pack years, etc. may be relevant for future work.

L312: Where no archaea identified? 16S amplicon metagenomics should identify archaea in addition to bacteria, if present.
We removed them, in addition to chloroplast, mitochondrial and eukaryotic sequences. We realized we left them out in the methods when we mentioned the types of sequences we removed. Changed: removed as were all eukaryotes, mitochondria, and chloroplasts.
To: removed as were all eukaryotes, mitochondria, archaea, and chloroplasts.

L339: What correlation metric is used and is it statistically tested?
Added (SparCC, p=0.05) after "correlation" (note this is now in the Figure caption, per your previous suggestion) Figure 9. Text is too small to be easily readable, or else the figure resolution is too low. Please recify.
Separated the figures and saved them as higher resolution.
L390: I would like to see an attempt to combine the results from co-occurance analysis with the DESeq2 and LEFsE results.

Added the following:
Ralstoniawas earlier reported in our Krona charts as much more abundant in Former and Active groups compared to Never, and all three Ralstoniataxa were reported by DESeq2 as distinguishing Former and Active samples compared to Never. Here, we see a nearly identical Ralstonia-heavy cluster of taxa forming in the Former and Active networks, in both cases rivaling a cluster consisting of many previously reported healthy oral and lung microbes, including Streptococcus, Veillonella, Prevotella, Haemophilus, and Fusobacterium. All five of these taxa were reported by DESeq as distinguishing the Never group (three by LEfSe as well), which leads to questions regarding the potential impact of Ralstoniain the Former and Active microbial ecosystems.
L396: Contamination can be statistically accounted for by the use of appropriate use of negative controls and decontamination of OTU tables using methods such as decontam in R (seehttps://www. mdpi. com/ 2076-2607/ 9/ 3/ 492).
We agree, although our point was not so much that we were concerned about contamination, only that oropharyngeal species would still be discovered in BAL despite potential for contamination. To avoid this confusion, we are fine removing the phrase: "Although some contamination during bronchoscopic sampling cannot be eliminated, the" L409: Are these more or less abundant in non-smokers? Also, I assume nonsmokers refers to Never Smokers?
We used the term "non-smokers" because we were comparing our "never smokers" to Morris' nonsmokers (= never + former).
Upon further inspection we removed Enterobacteriaceae, as it was found to be disproportionately abundant in the lung vs. the mouth, but not necessarily for non-smokers.
L448: Does "single subject" refer to a single group as opposed to an individual person/sample? Please clarify Yes; changed "that avoided each other in a single subject" to "tend to co-avoid".
L453: This idea of the core lung microbiome is first presented in the discussion. It should first be mentioned in the results section.
Moved the sentence: "One important observation is that a small community of co-occurring bacteria were found to be common among all three groups. Possible explanations of this phenomenon could that these bacteria may be part of a "core" microbiome playing an important functional role in the lung. " to the end of results.
Changed the beginning of the sentence in Discussion to smooth over: "In our co-occurrence networks we observe a potential "core" group consisting of the following taxa: two OTUs from the genus Propionibacterium, and OTUs from Delftia, Pseudomonas, and Stenotrophomonas. Possible explanations for this common cluster could be that these microbes play some housekeeping role in the lung or are involved in some processes and pathways unaffected by smoking. " L469: Using PERMANOVA with independent variables such as Pack Years or Time Since Quitting would allow the authors to quantify these subtypes in their own dataset. See our comment above (L293).

L503: If present, archaea would be detected by 16S profiling.
See our response to L312; we removed them and added a clarification to methods indicating this.
L616: what Shotgun libraries are being referred to here? This is an amplicon metagenomics study.
Changed "A shotgun library sample" to "A library sample".

L630-633: Can the authors please describe what the alignment and reverse complement step achieves? I have not seen this before.
If the alignment step removes more than 50% of the reads we assume this could be an indicator that the reads are flipped. Added a qualifier ", indicating the reads may have been flipped" to the end of the sentence. L655: Are "distinguishing" taxa those which are more abundant?
Yes. Added the qualifier (based on abundance) after distinguishing.
L659: Was LEfSe run in a pairwise manner or were all three groups considered together? If the latter, was a taxon determined to be discriminatory if it was significantly enriched in one group compared to the both other groups or just one?
All three together, and the taxon would need to be significantly enriched compared to both other groups. Changed the phrase: "between the two groups" to "between each group and the other two. "

L675: Please provide some info on how SparCC works (just the coefficient calculated and whether it was statistically tested)
Added the qualifier (p=0.05) after SparCC.
Added the sentence: SparCC iteratively estimates correlation using log-ratio transformations of relative abundances, and operates under the assumption that microbial data is sparse.

Suggestions/Miscellaneous
L56-58: I suggest having this as the final sentence of the abstract as it is more of a conclusion and wraps up the story of the manuscript. "The groups were sufficiently distinct from each other, suggesting that cessation of smoking may not be sufficient for the lung microbiota to return to a composition similar to that of never smokers." Added your suggested sentence.

L112: The authors could confirm this is downloading the data of Morris et al. and reanalysing using their methods to determine if they can detect a difference in microbiome structure.
We actually cannot using the dataset of Morris et al, because they did not include former smokers. However, I do not have a problem and removed the sentence: "Different analytical approaches may disclose more subtle differences in the airway microbiome about current or former smoking. " L213: The use of "increase in" here suggests an intervention of some kind. Eg. individuals stopped smoking and their microbiome diversity increased. Please consider rephrasing to "markedly higher" or similar.
Rephrased to "markedly higher", as per your suggestion. com/ articles/ s41467-022-28034-z]. I would urge the authors to consider reanalysing their data using one of these approaches. L109: The authors note that the study by Morris et al. is limited because it featured imbalanced group sizes. It is worth noting that the current study also featured an imbalanced design (

Never Smokers vs. 24 Former Smokers and 22 Active Smokers).
If the authors feel that an imbalanced design hinders the validity of results, then I ask them to please explain how their study is not impacted. L648: It is good practice to remove OTUs from a dataset if they fail a defined inclusion criteria (such as zero in > 50% of samples or

Anonymous.
Date report received: 10 October 2022 Recommendation: Minor Amendment Comments: This paper describes a 16S rRNA based microbiome analysis of samples collected from the lower respiratory tract of people who have never smoked, are former smokers and are current smokers. The work is an improvement upon previous analyses as it is based on BAL samples, which should limit the amount of contamination from the upper respiratory tract. Overall the paper is well clear and the approach seems sound, but I have a few comments: 1) The authors imply that their study is an improvement upon others as they have more balanced group sizes. However in this paper they had 9 never smokers compared to 24 former and 22 active smokers. The authors need to discuss this in the discussion, and whether they think they had enough data from never smokers to make robust conclusions. 2) In the results section the authors often discuss OTUs such as "Ralstonia.03" -these designations are unhelpful for the reader, so a more intuitive way of referring to these OTUs needs to be devised. What species most closely match these OTUs? 3) I understand why the bacterial co-occurrence networks were analysed separately in the different cohort groups -but I also think it be more powerful to analyse them as a combined dataset Minor comments: Line 155 "composing a whopping 75%" -I don't think whopping is appropriate for an academic manuscript Line 295: please explain the cross-validation errors more -what does this mean? Fig -6 needs bigger legends Line 553 -DNA itself isn't metagenomic, the approach is. Please remove "metagenomic" Please rate the manuscript for methodological rigour Very good

Please rate the quality of the presentation and structure of the manuscript Satisfactory
To what extent are the conclusions supported by the data? Strongly support